Abstract

Vehicles are essential to transportation as a society, and one of the main methods of obtaining a vehicle is through the purchase of a previously owned vehicle. Belarus, as with many countries, has an extensive used vehicle market that will be extensively studied. A popular used cars catalog from the country of Belarus will be cleaned and analyzed to answer several questions as well as accomplish our main two goals. Multiple Linear Regression, SVR, Decision Tree Regression, Random Forest Regression, and KNN will all be used to predict the market price of a used car in Belarus given the attributes found in the dataset. Also, a Decision Tree and a Logistic Regression Model will be used to create an exploratory and predictive model for the exchange preferences of a vehicle owner in Belarus. The accuracy of these models will be analyzed, and performance tested for real-world application with the help of a training and testing dataset.

OUR FINDINGS: We found that the Random forest tree had the best R-Squared value, lowest error rate, and lowest RMSE which tells us that this model technique is the best of all the models tested for making predictions from our dataset. The two worst prediction models were the Multiple Linear Regression, and SVR Radial. Below is a chart of all our results from each of our models. We note that a lower error rate is better, and the higher the R-Squared value the better the correlation the model has with the data.

Machine-Learning Methods RMSE Error Rate R-Square
Multiple Linear Regression 4470.274 67.57975 0.5190641
SVR Linear 3257.887 48.86921 0.7772176
SVR Radial 4752.231 71.28478 0.5937837
Decision Tree Regression 3245.413 48.68209 0.7529956
Random Forest Tree Regression 1879.884 28.19878 0.9184761
KNN(K Nearest Neighbor) 3693.127 55.39793 0.6802895

Lastly, we employed used two different models to test whether the vehicle owner was willing to exchange his vehicle. The two models that were used were the decision tree and logistic classification. Both trees were very similar; however, the decision tree was determined to perform slightly better as an exploratory and predictive model. The results of the models are as follows:

Models Accuracy Error Sensitivity Specificity Positive Prediction Value AUC
Decision Tree Classification: Exploratory Model 0.6939802 0.3060198 0.6349 0.7064 0.3116 0.6571428
Logistic Classification: Exploratory Model 0.6624371 0.3375629 0.58457 0.67010 0.14850 0.6505955
Decision Tree Classification: Prediction Model 0.6871661 0.3128339 0.6062 0.7039 0.2980 0.6525616
Logistic Classification: Prediction Model 0.6650163 0.3349837 0.58041 0.67329 0.14808 0.6480223

The AUC, accuracy, sensitivity, and specificity are higher in the decision tree. All these values allow us to conclude that the Decision Tree produces a better predictive model. Nevertheless, both models produce good predictive and exploratory of whether a vehicle owner is willing to exchange a vehicle.

Introduction

Focus Problems

There are several objectives that will be the focal point of the research done on the used cars dataset. The main scope of this project is to create an effective predictive model for the price of a used in Vehicle in Belarus. An exploratory and predictive model will be created for the likelihood of a vehicle owner being willing to exchange their vehicle. Several questions will be answered in the process of creating the following models and to gain a further understanding of the overall dataset.

Problem Motivations

The popularity of the used vehicle market and personal experience in buying a vehicle led to initial intrigue in the used dataset. Furthermore, having a project member who was born in Eastern Europe led to added interest in how the vehicle market differs from the used car market in America. Interest in what factors contributed the most to a vehicles price, and the desire to successfully predict a vehicles price given those factors led to the desire to produce predictive models for price. Also, personal experience with vehicle owners offering vehicle exchanges motivated the desire to create an exploratory and predictive model for the willingness of an owner to offer an exchange.

Problem Solving Methodology

To solve our questions and answer our goal we are going to use R to make graphs to answer our questions. We will make sure to use different statistical analyses to make sure our graphs are confirmed to be accurate. The use of the Chi-Square Test, Anova Test, as well as the Correlation Test will be extensive to answer our questions. Furthermore, the dataset will be split into a training and testing set to verify the accuracies of the models that will be produced. Several models will be tested to achieve the most accurate result and these accuracies will be verified using RMSE, R-Squared, and error rate for regression models as well as accuracy, sensitivity, specificity, and AUC for the classification models.

Initial Questions

The main goal is to create a predictive model based on insights gained from analyzing the impact of variables on the selling price of a vehicle. The goal was chosen for its applicability and general interest in which attributes impact price of a used car. Secondly, exploratory, and predictive models to gauge whether a vehicle would be exchangeable or not will be created using a Decision Tree and Logistic Classification. Both these models will be analyzed for accuracy, sensitivity, and specificity. In addition, there will be several questions that will be answered concerning the dataset. These questions aid in solving the main goal and are important insights to be gleaned from the dataset. The questions that will be answered about the dataset are as follows:

Dataset

Dataset Source

Kaggle dataset: Used Cars Catalog.

The dataset that was chosen was obtained from Kaggle and was scraped by an individual named Kirill Lepchenkov with the help of an associate named Vasily Kachalko who aided in parsing the data. This dataset will be utilized to understand the effect certain characteristics have on the price point at which cars in Belarus are sold at. Analyzing cars that are for sale is beneficial in gaining an understanding on what contributes to how a used car would be priced and would allow for predicting the market price of a used car in Belarus.

Dataset Description

This dataset has 30 attributes with 38,531 samples. Of these attributes there is a mixture of Categorical (Nominal) as well as Numerical (Interval, Ratio) data. The attributes included in the dataset are Manufacturer Name, Model Name, Transmission Type, Color, Odometer Value, Year Produced, Engine Fuel, Engine Has Gas(Boolean), Engine Type, Engine Capacity, Body Type, Has Warranty(Boolean), (State) Condition of the Vehicle, Drivetrain Type, Price in USD, Exchangeability (Boolean), Location Region, Number of Photos, Number of Up Counts, Features {0,1,2,3,4,5,6,7,8,9}, and Duration Listed.

Data Cleaning

To clean the dataset, it was vital to open the csv file and investigate the types of attributes and samples that were given. To get a quick glance of the data we first open the csv file as follows:

cars <- read_csv("cars.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_logical(),
##   manufacturer_name = col_character(),
##   model_name = col_character(),
##   transmission = col_character(),
##   color = col_character(),
##   odometer_value = col_double(),
##   year_produced = col_double(),
##   engine_fuel = col_character(),
##   engine_type = col_character(),
##   engine_capacity = col_double(),
##   body_type = col_character(),
##   state = col_character(),
##   drivetrain = col_character(),
##   price_usd = col_double(),
##   location_region = col_character(),
##   number_of_photos = col_double(),
##   up_counter = col_double(),
##   duration_listed = col_double()
## )
## i Use `spec()` for the full column specifications.
head(cars, 20)
## # A tibble: 20 x 30
##    manufacturer_na~ model_name transmission color odometer_value year_produced
##    <chr>            <chr>      <chr>        <chr>          <dbl>         <dbl>
##  1 Subaru           Outback    automatic    silv~         190000          2010
##  2 Subaru           Outback    automatic    blue          290000          2002
##  3 Subaru           Forester   automatic    red           402000          2001
##  4 Subaru           Impreza    mechanical   blue           10000          1999
##  5 Subaru           Legacy     automatic    black         280000          2001
##  6 Subaru           Outback    automatic    silv~         132449          2011
##  7 Subaru           Forester   automatic    black         318280          1998
##  8 Subaru           Legacy     automatic    silv~         350000          2004
##  9 Subaru           Outback    automatic    grey          179000          2010
## 10 Subaru           Forester   automatic    silv~         571317          1999
## 11 Subaru           Forester   mechanical   other         280000          2003
## 12 Subaru           Tribeca    automatic    grey          256000          2008
## 13 Subaru           Forester   mechanical   other         321000          2002
## 14 Subaru           Justy      mechanical   red            49999          2001
## 15 Subaru           Outback    automatic    brown         154685          2011
## 16 Subaru           Outback    automatic    black         163219          2004
## 17 Subaru           Outback    automatic    other         318650          2005
## 18 Subaru           Impreza    mechanical   blue          191000          2005
## 19 Subaru           Forester   automatic    silv~         179000          2014
## 20 Subaru           Forester   automatic    black         159000          2013
## # ... with 24 more variables: engine_fuel <chr>, engine_has_gas <lgl>,
## #   engine_type <chr>, engine_capacity <dbl>, body_type <chr>,
## #   has_warranty <lgl>, state <chr>, drivetrain <chr>, price_usd <dbl>,
## #   is_exchangeable <lgl>, location_region <chr>, number_of_photos <dbl>,
## #   up_counter <dbl>, feature_0 <lgl>, feature_1 <lgl>, feature_2 <lgl>,
## #   feature_3 <lgl>, feature_4 <lgl>, feature_5 <lgl>, feature_6 <lgl>,
## #   feature_7 <lgl>, feature_8 <lgl>, feature_9 <lgl>, duration_listed <dbl>

Afterwards, it was crucial to obtain a quick summary of our dataset to assess the steps needed to clean the dataset. This was done with the following line of code:

summary(cars)
##  manufacturer_name   model_name        transmission          color          
##  Length:38531       Length:38531       Length:38531       Length:38531      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  odometer_value    year_produced  engine_fuel        engine_has_gas 
##  Min.   :      0   Min.   :1942   Length:38531       Mode :logical  
##  1st Qu.: 158000   1st Qu.:1998   Class :character   FALSE:37184    
##  Median : 250000   Median :2003   Mode  :character   TRUE :1347     
##  Mean   : 248865   Mean   :2003                                     
##  3rd Qu.: 325000   3rd Qu.:2009                                     
##  Max.   :1000000   Max.   :2019                                     
##                                                                     
##  engine_type        engine_capacity  body_type         has_warranty   
##  Length:38531       Min.   :0.200   Length:38531       Mode :logical  
##  Class :character   1st Qu.:1.600   Class :character   FALSE:38082    
##  Mode  :character   Median :2.000   Mode  :character   TRUE :449      
##                     Mean   :2.055                                     
##                     3rd Qu.:2.300                                     
##                     Max.   :8.000                                     
##                     NA's   :10                                        
##     state            drivetrain          price_usd     is_exchangeable
##  Length:38531       Length:38531       Min.   :    1   Mode :logical  
##  Class :character   Class :character   1st Qu.: 2100   FALSE:24945    
##  Mode  :character   Mode  :character   Median : 4800   TRUE :13586    
##                                        Mean   : 6640                  
##                                        3rd Qu.: 8990                  
##                                        Max.   :50000                  
##                                                                       
##  location_region    number_of_photos   up_counter      feature_0      
##  Length:38531       Min.   : 1.000   Min.   :   1.00   Mode :logical  
##  Class :character   1st Qu.: 5.000   1st Qu.:   2.00   FALSE:29725    
##  Mode  :character   Median : 8.000   Median :   5.00   TRUE :8806     
##                     Mean   : 9.649   Mean   :  16.31                  
##                     3rd Qu.:12.000   3rd Qu.:  16.00                  
##                     Max.   :86.000   Max.   :1861.00                  
##                                                                       
##  feature_1       feature_2       feature_3       feature_4      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:15135     FALSE:29907     FALSE:27904     FALSE:29227    
##  TRUE :23396     TRUE :8624      TRUE :10627     TRUE :9304     
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##  feature_5       feature_6       feature_7       feature_8      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:24811     FALSE:31943     FALSE:28369     FALSE:22528    
##  TRUE :13720     TRUE :6588      TRUE :10162     TRUE :16003    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##  feature_9       duration_listed  
##  Mode :logical   Min.   :   0.00  
##  FALSE:16206     1st Qu.:  23.00  
##  TRUE :22325     Median :  59.00  
##                  Mean   :  80.58  
##                  3rd Qu.:  91.00  
##                  Max.   :2232.00  
## 

From the summary we clearly see that engine_has_gas, has_warranty, and state have very little variability and for that reason they will be removed from the dataset. Furthermore, upon researching the features attribute it was uncovered that the data had severe inconsistencies (according to the creator) and for that reason they will be removed. To accomplish this, we run the select command to remove these attributes.

cars_edited <- dplyr::select(cars, -8 & -(12:13) & -(20:29))
head(cars_edited, 20)
## # A tibble: 20 x 17
##    manufacturer_na~ model_name transmission color odometer_value year_produced
##    <chr>            <chr>      <chr>        <chr>          <dbl>         <dbl>
##  1 Subaru           Outback    automatic    silv~         190000          2010
##  2 Subaru           Outback    automatic    blue          290000          2002
##  3 Subaru           Forester   automatic    red           402000          2001
##  4 Subaru           Impreza    mechanical   blue           10000          1999
##  5 Subaru           Legacy     automatic    black         280000          2001
##  6 Subaru           Outback    automatic    silv~         132449          2011
##  7 Subaru           Forester   automatic    black         318280          1998
##  8 Subaru           Legacy     automatic    silv~         350000          2004
##  9 Subaru           Outback    automatic    grey          179000          2010
## 10 Subaru           Forester   automatic    silv~         571317          1999
## 11 Subaru           Forester   mechanical   other         280000          2003
## 12 Subaru           Tribeca    automatic    grey          256000          2008
## 13 Subaru           Forester   mechanical   other         321000          2002
## 14 Subaru           Justy      mechanical   red            49999          2001
## 15 Subaru           Outback    automatic    brown         154685          2011
## 16 Subaru           Outback    automatic    black         163219          2004
## 17 Subaru           Outback    automatic    other         318650          2005
## 18 Subaru           Impreza    mechanical   blue          191000          2005
## 19 Subaru           Forester   automatic    silv~         179000          2014
## 20 Subaru           Forester   automatic    black         159000          2013
## # ... with 11 more variables: engine_fuel <chr>, engine_type <chr>,
## #   engine_capacity <dbl>, body_type <chr>, drivetrain <chr>, price_usd <dbl>,
## #   is_exchangeable <lgl>, location_region <chr>, number_of_photos <dbl>,
## #   up_counter <dbl>, duration_listed <dbl>

Now, to proceed with cleaning and analyzation we need a more robust knowledge of our dataset. For this reason, it is vital to understand the type of data for each attribute. Therefore, we create a table to organize all our attributes.

Column Details Value/(Units) Data Type
A Manufacturer Name Name of vehicle manufacturer Nominal
B Model Name Name of vehicle model Nominal
C Transmission Type of the transmission Nominal
D Color Vehicle body color Nominal
E Odometer Value Vehicle odometer value in km Continuous
F Year Produced The year the car was produced Continuous
G Engine Fuel Fuel type of the engine Nominal
I Engine Type Vehicle engine type Nominal
J Engine Capacity The capacity of the engine in liters Nominal
K Body Type Type of the body (hatchback, sedan, etc.) Nominal
M Drivetrain Front/rear/all drivetrain Nominal
N Price USD Price of a cars as listed in the catalog in USD Continuous
O Exchangeable If True, the owner of the car is ready to exchange this car to other cars with little or no additional payment
Nominal
P Location Region The region in Belarus where the car is listed for sale Nominal
Q Number of Photos Number of photos the car has Continuous
R Up Counter How many times the sample has been upped in the catalog to raise its position Continuous
S Duration Listed Number of days the car is listed in the catalog Continuous

From summary we notice that there exists 10 NAs in the attribute engine-capacity. Our understanding of vehicles informs us that engine-capacity is a categorical datatype and as such NA is crucial to our understanding of the data. The NA values serve as a distinguishment for the Electric Vehicles in the data. For the sake of future analyzation, it will be useful to replace our NA values with an arbitrary number. We choose to change NA values to -1. Thus, we do the following:

summary(cars_edited$engine_capacity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.200   1.600   2.000   2.055   2.300   8.000      10
cars_edited <- cars_edited %>% mutate(engine_capacity = coalesce(engine_capacity, -1))
summary(cars_edited$engine_capacity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.000   1.600   2.000   2.054   2.300   8.000

Once the unnecessary attributes were filtered and the NAs were accounted for it is important to investigate for duplicate samples. The code for doing this procedure was as follows:

count(cars_edited)
## # A tibble: 1 x 1
##       n
##   <int>
## 1 38531
sum(duplicated(cars_edited))
## [1] 41
which(duplicated(cars_edited))
##  [1]  5768  6557  9997 11381 11705 12811 17339 18185 27961 31994 32018 32019
## [13] 32023 32026 32029 32035 32037 32054 32127 32129 32180 32181 32296 32297
## [25] 32299 32300 32302 32303 32305 32307 32308 32309 32311 32313 32314 32316
## [37] 32317 32318 32319 32322 32326
cars_edited <- cars_edited %>% distinct()
count(cars_edited)
## # A tibble: 1 x 1
##       n
##   <int>
## 1 38490

Duplicate entries were identified and removed from the dataset. 41 such duplicate entries existed.

Dataset Outliers

To determine what to do with the outliers we identified the outliers for each continuous variable and created a separate dataset with no outliers. Since there is no reason to believe this data is faulty, we will leave these outliers in the dataset and proceed with the investigation. If we were to gauge the impact of our outliers, we would have to run our analytics through the dataset with the outliers and the dataset without the outliers.

Dataset Training and Testing

To effectively analyzing the real-world performance of the models we split the dataset into a “Training” dataset and a “Testing dataset. The”Training" set comprised of a random selection of 80% of our dataset and the “Testing” was the rest of the samples not included in the “Training” dataset (20%). The reason for the split was to be able to access how accurately the models were able to predict the price of a vehicle given a separate portion of samples in the dataset. By testing our models against data not in our training set we can effectively see how well the model holds up with the addition of new vehicles we wish to predict. This was done as following:

set.seed(123)
training.samples <- cars_edited$manufacturer_name %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- cars_edited[training.samples,]
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
test.data <- cars_edited[-training.samples,]

Data Analysis and Modeling

Visualizations used to analyze data

To gain a robust knowledge of our dataset several visualizations were used. Visualization of our data was done in two sections. Initially each attribute was graphed to gain a general understanding of how the samples are distributed for each attribute. This was done with the help of bar graphs, histograms, boxplots, the count function, and pie graphs.

# 1) What is the distribution of manufacturers?
ggplot(cars_edited, aes(y = manufacturer_name)) + geom_bar(aes(fill = manufacturer_name)) + geom_text(stat='count', aes(label=..count..), hjust=1)

We can see a large difference in the amount cars for each manufacturer. Volkswagen, Opel, BMW, Audio, AvtoVAZ, Ford, Renault, and Mercedes-Benz are the major manufacturers.

# 2) A table to show unique car model names and quantity
cars_edited %>% count(model_name)
## # A tibble: 1,118 x 2
##    model_name     n
##  * <chr>      <int>
##  1 <U+0410>21            8
##  2 <U+0410>22            3
##  3 <U+041C>20            6
##  4 <U+041C>5             4
##  5 100          371
##  6 1007           6
##  7 100NX          4
##  8 106           14
##  9 107           12
## 10 11             2
## # ... with 1,108 more rows
# 3) Plotting the number of cars with automatic or mechanical transmissions
transmissionGrouped <- group_by(cars_edited, transmission)
transmissionCounted <- count(transmissionGrouped)
percentTransmission <- paste0(round(100*transmissionCounted$n/sum(transmissionCounted$n), 2), "%")
pie(transmissionCounted$n, labels = percentTransmission, main = "Transmission Distribution", col = rainbow(nrow(transmissionCounted)))
legend("right", c("Automatic", "Mechanical"), cex = 0.8,
       fill = rainbow(length(transmissionCounted)))

Mechanical is significantly more common than Automatic. This will be an attribute to consider in our final model

# 4) Plotting cars by color and quantity
ggplot(cars_edited, aes(x = color)) + geom_bar(stat = "count", aes(fill = color)) + geom_text(stat = "count", aes(label = after_stat(count)), vjust = -1)

There is a lot of diversity in the colors and once again although there are categories with more values there exists a decent amount of variation

# 5) Histogram Odometer Value: Graph to see how the data is skewed
ggplot(cars_edited, aes(odometer_value)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The data is left-skewed with what appears to be outliers at around 1,000,000 miles

# 6) Histogram Year produced: Graph to see how the data is skewed
ggplot(cars_edited) + geom_histogram(aes(year_produced))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The graph seems to almost be normally distributed minus what appears to be some outliers in the earlier years.

# 7) Graph to show what fuel distribution
ggplot(cars_edited, aes(x = engine_fuel)) + geom_bar(stat = "count", aes(fill = engine_fuel)) + geom_text(stat = "count", aes(label = after_stat(count)), vjust = -1)

There are only 2 major engine fuels (gasoline and diesel)

# 8) Pie graph to show engine type Distribution (Electric, Diesel, Gasoline)
TypeGrouped <- group_by(cars_edited, engine_type)
TypeCounted <- count(TypeGrouped) 
percentType <- paste0(round(100*TypeCounted$n/sum(TypeCounted$n), 2), "%")
pie(TypeCounted$n, labels = percentType, main = "Engine Type Distribution", col = rainbow(nrow(TypeCounted)))
legend("right", c("diesel", "electric", "gasoline"), cex = 0.8,
       fill = rainbow(nrow(TypeCounted)))

Not surprisingly gasoline and diesel are the 2 most common Engine Type considering the fuel distribution

# 9) Table for Engine capacity
cars_edited %>% count(engine_capacity)
## # A tibble: 62 x 2
##    engine_capacity     n
##  *           <dbl> <int>
##  1            -1      10
##  2             0.2     6
##  3             0.5     1
##  4             0.8    53
##  5             0.9    17
##  6             1     274
##  7             1.1   163
##  8             1.2   563
##  9             1.3   875
## 10             1.4  2393
## # ... with 52 more rows

Engine Capacity seems to be left-skewed which may indicate outliers

# 10) Bar graph Body type: count how many cars have the same body type
ggplot(cars_edited, aes(x = body_type), stat = "count") + geom_bar() + geom_text(stat = "count", aes(label = after_stat(count)), vjust = -1)

There is some diversity in body type and the diversity in categories may lend itself to useful data for a future model

# 11) Graph Drivetrain distribution:
drivetrainGrouped <- group_by(cars_edited, drivetrain)
drivetrainCounted <- count(drivetrainGrouped) 
percentdrivetrain <- paste0(round(100*drivetrainCounted$n/sum(drivetrainCounted$n), 2), "%")
pie(drivetrainCounted$n, labels = percentdrivetrain, main = "Drivetrain Distribution", col = rainbow(nrow(drivetrainCounted)))
legend("right", c("all", "front", "rear"), cex = 0.8,
       fill = rainbow(nrow(drivetrainCounted)))

Although most vehicles are front wheel drive there is enough all and real wheel drive to gather some promising insights

# 12) Number of cars with same price
ggplot(cars_edited, aes(x = price_usd), stat = "count") + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This graph is extremely left-skewed. The sheer usefulness of price in our dataset will make this our main response attribute.

# 13) Pie Graph exchangeability Distribution
exchangeableGrouped <- group_by(cars_edited, is_exchangeable)
exchangeableCounted <- count(exchangeableGrouped) 
percentexchangeable <- paste0(round(100*exchangeableCounted$n/sum(exchangeableCounted$n), 2), "%")
pie(exchangeableCounted$n, labels = percentexchangeable, main = "Exchangeability Distribution", col = rainbow(nrow(exchangeableCounted)))
legend("right", c("False", "True"), cex = 0.8,
       fill = rainbow(nrow(exchangeableCounted)))

Exchangeability is more common than anticipated. It will be interesting to see if pricier or cheaper cars consent to exchanges

# 14) Pie Graph Location region: Count the number of cars in a region
regionPriceDF <- group_by(cars_edited, location_region)
regionPriceDFCount <- count(regionPriceDF)
percentRegion <- paste0(round(100*regionPriceDFCount$n/sum(regionPriceDFCount$n), 2), "%")
pie(regionPriceDFCount$n, labels = percentRegion, main = "Region Price Distribution", col = rainbow(nrow(regionPriceDFCount)))
legend("right", c("Brest Region", "Gomel Region", "Grodno Region", "Minsk Region", "Mogilev Region", "Vitebsk Region"), cex = 0.8,
       fill = rainbow(nrow(regionPriceDFCount)))

Minsk accounts for a very large number of vehicles (makes sense considering the population sizes) with even distributions everywhere else. The usefulness of the attribute may be less since Minsk is such a large portion of the data.

# 15) Histogram Number of photos: Graph to see how the data is skewed
ggplot(cars_edited) + geom_histogram(mapping = aes(number_of_photos))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The graph is very left-skewed. I suspect photos may increase the value of a vehicle, but tests will need to be done to affirm this.

# 16) Box plot Number of photos: Graph to see how the data is skewed
ggplot(cars_edited) + geom_boxplot(mapping = aes(number_of_photos))

There are many outliers. With extra time we may be able to investigate the impact of these outliers on the data.

# 17) Histogram Up counter: investigating how our outliers look with our modifications
ggplot(cars_edited) + geom_histogram(mapping = aes(up_counter))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Clearly an outlier exists since there is a large scale.

# 18) Box plot Duration listed: investigating how our outliers look with our modifications
ggplot(cars_edited) + geom_boxplot(mapping = aes(duration_listed))

There is a significant number of outliers. There is no evidence to conclude these should be eliminated.

# 19) Histogram Duration listed: Graph to see how the data is skewed
ggplot(cars_edited) + geom_histogram(aes(duration_listed))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histogram shows a skew in the duration listing for our dataset.

Once a general understanding of each attribute was gained, we began to visualize various combinations of attributions. These visualizations were vital in seeing different relationships among our samples. The visualizations that were used were Balloon Plot, Scatterplot, frequency polygons, dplyr::summarize, and Boxplots.

# 1) Graph to show the number of cars(by manufacturer name) in a region BALLOON PLOT
ggplot(cars_edited, aes(location_region, manufacturer_name)) + geom_count()

Due to the quantity of categories a test will need to be done to gather significant data

# 2) Graph to show the price of a car according to its year produced SCATTER PLOT
ggplot(cars_edited, aes(year_produced, price_usd)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

There exists a parabolic relationship between year_produced and price_usd

# 3) Graph to show the number of cars in specific colors(10 red cars, 8 blue cars etc.) by region BAR GRAPH
ggplot(cars_edited, aes(color)) + geom_bar(aes(fill = location_region))

From looking at the bar graph there does not seem to be any significant differences in color distribution for locations

# 4) Graph to show the price of a car according to its millage(odometer) SCATTER PLOT
ggplot(cars_edited, aes(odometer_value, price_usd)) + geom_point(aes(color = is_exchangeable)) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

This graph is incredibly diverse and indicates that there is a need for advanced models to access price relationships.

# 5)Graph to show the price of a car according to its year produced AND body type SCATTER PLOT
ggplot(cars_edited, aes(year_produced, price_usd)) + geom_point(aes(color = body_type)) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

The quantity of samples makes this graph hard to dissect. One would have to perform tests to see how body type relates to price.

# 6) Group by car body type and get its mean price
group_by(cars_edited, body_type) %>% summarise(price_mean = mean(price_usd))
## # A tibble: 12 x 2
##    body_type price_mean
##  * <chr>          <dbl>
##  1 cabriolet     10976.
##  2 coupe          7458.
##  3 hatchback      4036.
##  4 liftback       7873.
##  5 limousine      8154.
##  6 minibus        8466.
##  7 minivan        6131.
##  8 pickup        11748.
##  9 sedan          5782.
## 10 suv           13768.
## 11 universal      5017.
## 12 van            6675.
# 7) Graph to show the outliers with body type and price BOX PLOT
ggplot(cars_edited) + geom_boxplot(mapping = aes(x = reorder(body_type, price_usd), y =
                                                   price_usd))

This graph tells us that given more time it may be useful to test our models without outliers.

# 8) Graph to show the correlation between car body type, price, AND engine fuel
ggplot(cars_edited) + geom_point(mapping = aes(x = body_type, y = price_usd, color = engine_fuel))

This graph tells us that certain body types tend to come with engines for specific fuels. The range for prices seems to be very scattered.

# 9) Graph to show the price of a car according to its number of photos incl. engine fuel SCATTER PLOT
ggplot(cars_edited) + geom_point(mapping = aes(x = number_of_photos, y = price_usd, color = engine_fuel))

This graph leads us to believe that there very well may be no correlation between price and number of photos. Further investigation will be needed to affirm this theory.

# 10) Group cars by manufacturer, and get its mean price
cars_edited %>% group_by(manufacturer_name) %>% summarize(mean(price_usd))
## # A tibble: 55 x 2
##    manufacturer_name `mean(price_usd)`
##  * <chr>                         <dbl>
##  1 Acura                        12773.
##  2 Alfa Romeo                    2689.
##  3 Audi                          7155.
##  4 AvtoVAZ                       1519.
##  5 BMW                           9552.
##  6 Buick                        12876.
##  7 Cadillac                     11093.
##  8 Chery                         4546.
##  9 Chevrolet                     8873.
## 10 Chrysler                      4995.
## # ... with 45 more rows

To find an optimal model to predict whether a vehicle is exchangeable we utilized both a Decision Tree and Logistic Regression.

Logistic Regression: This is a very popular and powerful classification technique. Since we are dealing with a binary response variable the logistic regression is an optimal model to use.

Decision Tree: Although the logistic regression model is extremely powerful, we will still run the Decision Tree to create a more accurate model. Due to the robust and simple nature of the decision tree it was important to use this model to predict exchangeability. This model should be able to handle any outliers in our data and provide an effective predictive model.

Several machine learning techniques were used to create the most accurate prediction model for our main goal. The models used in the project were as follows:

Multiple Linear Regression: This was used for gauging the impact of the continuous attributes on the price of the vehicle. This model is simple to understand and involves less computing power than more advance models which led to our decision to utilize this ML method as our first model.

SVR: Used to see if there is a model that can handle every attribute and adjust the model according to its impact on price. Although this model involves much more computing power, we believed that the robust nature of this model would lead to a high level of accuracy. This models effectiveness with categorical data, and ability to work with both linear and non-linear boundaries makes it a prime model for our dataset. Furthermore, to optimize our SVR model we will use linear, polynomial, and radial kernel transformations and compare the results from the models that we generate.

Decision Tree (Partition): Used for predictive modeling. Since it is incredibly robust and relies on very few assumptions, we believed it would be able to handle possible outliers in our data and work around the size of our dataset to produce an optimal model. Its simpler nature also makes it a model that would be preferred over other models (such as Random Forest Regression or KNN).

Random Forest Regression: Albeit Random Forest Regression is more complicated than a Decision Tree since it leverages multiple decision trees it is still a crucial model for our dataset. The reason for using this model is for the sake of seeing if we can create an even more accurate model. If we find the accuracy is marginally better, it may be preferred to use a Decision Tree since it is easier to compute. Nevertheless, the Random Forest Regression is a wonderful tool in creating a predictive model and worth testing out for the sake of optimization.

KNN:K Nearest Neighbor was chosen due to its popularity and simplicity. KNN can handle our Categorical as well as Continuous attributes and for that reason it is a good model to be used to predict price. It will be vital to compare this model with the other models to investigate how useful of a model it is for our dataset.

Final outcomes and Analysis

Questions and Answer Justifications

Q1: What impact does a region have on price?

A: Although the region does impact price the extent of this impact would have to be assessed in a model that includes more attributes. The reason for this is because correlation does not necessitate causation. In other words, more attributes may be at play and to get a better understanding of the impact of region on price it will be vital to access the role region has in the overall models.

Justification:

To answer this question, we decided to first use a bar chart to see the average prices per region in comparison with each other.

Looking at the graph, we can predict that region may play a part in price. We derive this prediction from the observation that although most regions have similar vehicle price averages Minsk has a significantly higher average. However, we cannot make a conclusion with just this graph, we need to confirm this by an appropriate test. The test used to see if region did in fact have a significant impact of price was the One-Way Anova Test. Anova was used because the region attribute consists of several categories and we wished to see its impact on a single continuous variable.

Prior to doing the One-Way Anova test we first performed some data transformation to gain a better understanding of the differences between the region prices.

group_by(regionPriceDF, regionPriceDF$location_region) %>%
  summarise(
    count = n(),
    mean = mean(price_usd, na.rm = TRUE),
    sd = sd(price_usd, na.rm = TRUE)
  )
## # A tibble: 6 x 4
##   `regionPriceDF$location_region` count  mean    sd
## * <chr>                           <int> <dbl> <dbl>
## 1 Brest Region                     2989 5091. 4652.
## 2 Gomel Region                     3140 5022. 4603.
## 3 Grodno Region                    2485 4745. 4223.
## 4 Minsk Region                    24193 7668. 7117.
## 5 Mogilev Region                   2678 4622. 4654.
## 6 Vitebsk Region                   3005 4870. 4450.

Once again quick inspection tells us that Minsk has a significantly different price and a much larger count. To affirm our suspicions from the graph and summarize we now perform the Anova test.

The hypotheses for the test are as follows:

H0 = The means of the different groups are the same

Ha = At least one sample mean is not equal to the others

Furthermore, we will use 0.05 as our significance level. The result of the Anova test was the following:

# Compute the analysis of variance
res.aov <- aov(regionPriceDF$price_usd ~ regionPriceDF$location_region,
               data = regionPriceDF)
# Summary of the analysis
summary(res.aov)
##                                  Df    Sum Sq   Mean Sq F value Pr(>F)    
## regionPriceDF$location_region     5 7.020e+10 1.404e+10   355.9 <2e-16 ***
## Residuals                     38484 1.518e+12 3.945e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the results of the one-way Anova we can confirm our prediction. The p-value was less than 0.05(<2e-16) and so we conclude that there are significant differences between the regions.

We continue our investigation of the problem at hand by using Tukey HSD to do multiple pairwise-comparisons between the means of our groups.

# Tukey Test
TukeyHSD(res.aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = regionPriceDF$price_usd ~ regionPriceDF$location_region, data = regionPriceDF)
## 
## $`regionPriceDF$location_region`
##                                      diff        lwr          upr     p adj
## Gomel Region-Brest Region       -69.35261  -526.7479   388.042672 0.9980976
## Grodno Region-Brest Region     -346.16487  -832.0693   139.739598 0.3250300
## Minsk Region-Brest Region      2576.92905  2229.9066  2923.951471 0.0000000
## Mogilev Region-Brest Region    -468.67476  -944.9226     7.573076 0.0567405
## Vitebsk Region-Brest Region    -220.94393  -683.3226   241.434780 0.7500113
## Grodno Region-Gomel Region     -276.81226  -757.3836   203.759111 0.5708661
## Minsk Region-Gomel Region      2646.28166  2306.7669  2985.796388 0.0000000
## Mogilev Region-Gomel Region    -399.32215  -870.1275    71.483215 0.1502298
## Vitebsk Region-Gomel Region    -151.59132  -608.3623   305.179697 0.9345414
## Minsk Region-Grodno Region     2923.09392  2546.0489  3300.138960 0.0000000
## Mogilev Region-Grodno Region   -122.50989  -621.0582   376.038406 0.9819607
## Vitebsk Region-Grodno Region    125.22095  -360.0959   610.537820 0.9775888
## Mogilev Region-Minsk Region   -3045.60381 -3410.1197 -2681.087958 0.0000000
## Vitebsk Region-Minsk Region   -2797.87298 -3144.0722 -2451.673795 0.0000000
## Vitebsk Region-Mogilev Region   247.73083  -227.9175   723.379142 0.6744633

Analyzing the results, we can conclude that the differences in average price between Minsk and every other region is statistically significant. The data set region does have an impact on vehicle price.


Q2: What is the distribution of manufacturers and whether manufacturers have a significant impact on the asking price of a vehicle?

A: We can confirm that there is a relationship between the manufacturer and asking price. The relationship seems to be one of the larger factors for asking price but is not significant enough to predict price on its own.

Justification:

First a bar graph was used to effectively plot the distribution of vehicles for each manufacturer.

# 1)What is the distribution of manufacturers?
ggplot(cars_edited, aes(y = manufacturer_name)) + geom_bar(aes(fill = manufacturer_name)) + geom_text(stat='count', aes(label=..count..), hjust=1)

Once the distribution of the Manufacturers was plotted, we turned our attention onto the second part of the question: Whether manufacturers have a significant impact on the asking price of a vehicle?

Another bar graph was used as a means of quickly inspecting how average price ranged between different manufacturers.

manuPriceDF <- group_by(cars_edited, manufacturer_name)
manuPriceDF_averages <- summarise(manuPriceDF, average_price_usd = mean(price_usd))
ggplot(manuPriceDF_averages, aes(x = average_price_usd, y = manufacturer_name)) + geom_bar(aes(fill = manufacturer_name),stat="identity") + geom_text(aes(label =  paste0("$",round(average_price_usd)), hjust = 1))

Looking at the bar graph we can predict that the manufacturer of a vehicle has an impact on asking price. Nevertheless, observation is not sufficient evidence and so we proceed by testing this claim. Once again, we are dealing with categorical data and we wish to compare the means of the prices for the Manufacturers. Naturally, we chose the One-Way Anova test to test our claim.

The hypotheses for the test are as follows:

H0 = The mean prices of the different groups are the same

Ha = At least one sample mean price is not equal to the others

Furthermore, we will use 0.05 as our significance level. The result of the Anova test was the following:

manuSumm <- group_by(manuPriceDF, manuPriceDF$manufacturer_name) %>%
  summarise(
    count = n(),
    mean = mean(price_usd, na.rm = TRUE),
    sd = sd(price_usd, na.rm = TRUE)
  )

# Compute the analysis of variance
res.aovTwo <- aov(manuPriceDF$price_usd ~ manuPriceDF$manufacturer_name,
                  data = manuPriceDF)

summary(res.aovTwo)
##                                  Df    Sum Sq   Mean Sq F value Pr(>F)    
## manuPriceDF$manufacturer_name    54 2.917e+11 5.402e+09   160.1 <2e-16 ***
## Residuals                     38435 1.297e+12 3.374e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Investigating the results of the one-way Anova we can confirm our claim. The p-value was less than 0.05(<2e-16) and so we conclude that there are significant differences between the manufacturers.

We continue with the Tukey HSD to do multiple pairwise-comparisons between the means of our groups. This will allow us to see exactly which manufacturers significantly differ in asking price.

# Result omitted for brevity's sake
# Tukey Test
TukeyHSD(res.aovTwo)

With the results see that numerous manufacturers are statistically different, and we can use this data to list every statistically different manufacturer.

We can confirm that there is a relationship between the manufacturer and asking price.


Q3: What is the relationship between odometer and price?

A: There is a low negative correlation between price and odometer.

Justification:

Initially a Scatter plot was used to quickly inspect for possible relationships between price and odometer.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

When we investigate this graph, we initially notice that the line of best fit seems to be going down as the odometer value increases. Nevertheless, the data had a significant amount of variation and further testing would need to be done to confirm our results.

Since we are investigating the relationship between 2 continuous variables, we begin by using a correlation test.

The hypotheses for the test are as follows:

H0 =There does not exist a correlation between Odometer Value and Vehicle Price

Ha = There does exist a correlation between Odometer Value and Vehicle Price

Furthermore, we will use 0.05 as our significance level. The result of the correlation test was the following:

#Getting cor value
cor.test(cars_edited$odometer_value, cars_edited$price_usd)
## 
##  Pearson's product-moment correlation
## 
## data:  cars_edited$odometer_value and cars_edited$price_usd
## t = -90.821, df = 38488, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4282966 -0.4118421
## sample estimates:
##        cor 
## -0.4201039

Since the p-value is less than 0.05 we can conclude that Price and Odometer are significantly correlated with a correlation coefficient of -0.4201039 and p-value of < 2.2e-16

We continue with our question by investigating how good of an indicator of price odometer is. To do this we will use linear regression and check the percentage of accuracy of that line. Afterwards, we will graph the linear regression line to provide us with a useful visual.

set.seed(123)
odometer_on_price <- lm (price_usd ~ odometer_value, data = cars_edited)

ggplot (cars_edited, aes(x=odometer_value, y=price_usd)) + geom_point() + stat_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

Once the linear regression model is created and graphed, we proceed to check the R2 to see the proportion of the prices that can be explained by the model, the variability of the beta coefficients, and the percentage error.

summary(odometer_on_price)
## 
## Call:
## lm(formula = price_usd ~ odometer_value, data = cars_edited)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11577  -3514  -1122   2064  40854 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.158e+04  6.203e+01  186.65   <2e-16 ***
## odometer_value -1.986e-02  2.186e-04  -90.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5830 on 38488 degrees of freedom
## Multiple R-squared:  0.1765, Adjusted R-squared:  0.1765 
## F-statistic:  8248 on 1 and 38488 DF,  p-value: < 2.2e-16
confint(odometer_on_price)
##                        2.5 %        97.5 %
## (Intercept)     1.145654e+04  1.169971e+04
## odometer_value -2.028451e-02 -1.942748e-02
#                       2.5 %        97.5 %
#(Intercept)     1.147053e+04  1.171322e+04
#odometer_value -2.032581e-02 -1.947015e-02

sigma(odometer_on_price)*100/mean(cars_edited$price_usd)
## [1] 87.89188
# [1] 87.80444

The R2 is 0.1774 which indicates that a low proportion of prices in the data can be explained by the model. Furthermore, the percentage error is 87.80444 which confirms how poor an exploratory model would be if it solely used odometer to predict price.

We can conclude that the higher the odometer the lower the price of the vehicle will be. Nevertheless, our linear regression model informs us that to create an accurate model we will need to consider more attributes.


Q4: Does the number of photos a vehicle has impact the selling price?

A: There exists a low positive correlation between number of photos a vehicle has and the selling price.

Justification:

To gain an intuitive understanding of the question we sought to use a scatter plot to see the relationship between price and number of photos.

#Scatter plot: Number of photos and price
ggplot(cars_edited, aes( x =number_of_photos, y=price_usd)) + geom_hex() + stat_smooth(color = "red") 
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

When we look at this graph, we can see that although there exists a line. Nevertheless, the data had a significant amount of variation and further testing would need to be done to get a result. It does seem that there will be a very little (if any) correlation between price and number of photos.

Since we are investigating the relationship between 2 continuous variables, we will be using the correlation test.

The hypotheses are as follows:

H0 =There does not exist a correlation between Number of Photos and Vehicle Price

Ha = There does exist a correlation between Number of Photos and Vehicle Price

We will use 0.05 as our significance level. The result of the correlation test was the following:

#getting the cor value
cor.test(cars_edited$number_of_photos, cars_edited$price_usd)
## 
##  Pearson's product-moment correlation
## 
## data:  cars_edited$number_of_photos and cars_edited$price_usd
## t = 65.382, df = 38488, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3071525 0.3251358
## sample estimates:
##       cor 
## 0.3161726

Since the p-value is less than 0.05 we can conclude that Price and Number of Photos are significantly correlated with a correlation coefficient of 0.3161726 and p-value of < 2.2e-16

Next, we aim to investigate how good of an indicator of price the number of vehicle photos is. We shall employ linear regression, check the percentage of accuracy of that line, and graph the linear regression line to provide us with a useful visual.

#Getting the formula for linear regression
set.seed(123)
number_of_photos_on_price <- lm (price_usd ~ number_of_photos, data = cars_edited)

#Scatter plot: Number of photos and price with linear regression line
ggplot (cars_edited, aes(x=number_of_photos, y=price_usd)) + geom_point() + stat_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

Although the linear regression model is created and graphed there remains information and insights to be gleaned. We proceed to check the R2 to see the proportion of the prices that can be explained by the model, the variability of the beta coefficients, and the percentage error.

summary(number_of_photos_on_price)
## 
## Call:
## lm(formula = price_usd ~ number_of_photos, data = cars_edited)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24082  -3884  -1585   2249  44082 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3418.275     58.159   58.77   <2e-16 ***
## number_of_photos  333.297      5.098   65.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6095 on 38488 degrees of freedom
## Multiple R-squared:  0.09997,    Adjusted R-squared:  0.09994 
## F-statistic:  4275 on 1 and 38488 DF,  p-value: < 2.2e-16
# R^2 is very low(0.1004) which tells us that number of photos is not a good indicator of price. 
# We can suspect that several more variables are in play.

confint(number_of_photos_on_price)
##                     2.5 %    97.5 %
## (Intercept)      3304.283 3532.2680
## number_of_photos  323.305  343.2882
#                     2.5 %    97.5 %
#(Intercept)      3300.5010 3528.5452
#number_of_photos  324.2843  344.2673
sigma(number_of_photos_on_price)*100/mean(cars_edited$price_usd)
## [1] 91.88472
# Our prediction error rate is extremely high (91.82279%) which explains the low correlation

The R2 is 0.1004 which indicates that a low proportion of prices in the data can be explained by the model. In other words, number of photos is not a good indicator of price (more attributes are needed in a model). Also, the percentage error is 91.82279 which is very high. This confirms how poor an exploratory model would be if it solely used number of photos to predict price.

We can conclude there is a low negative correlation between price and odometer. Our linear regression model tells us that to create an accurate model we will need to consider more attributes.


Q5: Does the number of times a vehicle has been upped in the catalog to raise its position impact the selling price?

A: The number of times a vehicle has been upped has a negligible impact on the selling price.

Justification:

To begin answering this question it was natural to use scatterplot (using 2 continuous attributes) to gain an intuitive understanding of any possible relationship.

# Regression analysis
ggplot(cars_edited, aes( x =up_counter, y=price_usd)) + geom_hex() + stat_smooth(color = "red")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Looking at the scatterplot we see a possible positive correlation although the variability of the data makes it hard for us to affirm this prediction. To continue we will use the correlation test (since we are dealing with continuous attributes) to check for a correlation between number of up counts and price of a vehicle.

The hypotheses for the correlation test are the following:

H0 =There does not exist a correlation between number of up counts and vehicle price

Ha = There does exist a correlation between number of up counts and vehicle price

We will use 0.05 as our significance level. The results are the following:

# UpCounterCorrelation
cor.test(cars_edited$up_counter, cars_edited$price_usd)
## 
##  Pearson's product-moment correlation
## 
## data:  cars_edited$up_counter and cars_edited$price_usd
## t = 11.352, df = 38488, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04780740 0.06772124
## sample estimates:
##        cor 
## 0.05777007

Since the p-value is less than 0.05 we can conclude that Price and number of up counts are correlated with a correlation coefficient of 0.05777007 and p-value of < 2.2e-16.

However, although they are correlated, we do see that the correlation coefficient is very close to 0. Meaning that the correlation is almost entirely negligible. If one were to use this as a predictor of price the results would be poor.

Nevertheless, we will use linear regression to see how poor of an indicator number of up counts really is. We will check the percentage of accuracy of the linear regression line and graph the linear regression line to provide useful insights.

# Create LM
up_counter_on_price <- lm (price_usd ~ up_counter, data = cars_edited)

#Finding how well this line fits the data
summary(up_counter_on_price)
## 
## Call:
## lm(formula = price_usd ~ up_counter, data = cars_edited)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14558  -4502  -1852   2305  43438 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6493.0457    34.9345  185.86   <2e-16 ***
## up_counter     8.5694     0.7548   11.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6413 on 38488 degrees of freedom
## Multiple R-squared:  0.003337,   Adjusted R-squared:  0.003311 
## F-statistic: 128.9 on 1 and 38488 DF,  p-value: < 2.2e-16
# R^2 is extremely low which affirms that number of photos is not a good indicator of price. 

confint(up_counter_on_price)
##                   2.5 %     97.5 %
## (Intercept) 6424.573138 6561.51822
## up_counter     7.089888   10.04893
sigma(up_counter_on_price)*100/mean(cars_edited$price_usd)
## [1] 96.69137
# Our prediction error rate is extremely high (96.65168%) which confirms to us that up_counter is a terrible predictor of price(as we can see by the correlation test)

#Scatter plot: up counter and price with regression line
ggplot (cars_edited, aes(x=up_counter, y=price_usd)) + geom_point() + stat_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

The R2 is 0.003311 which indicates that a low proportion of prices in the data can be explained by the model. In other words, number of up counts is a very poor indicator of price (we need more attributes in the model). Also, the percentage error is 96.69137 which is ludicrously high. This confirms how poor a model would be if it solely used number of up counts to predict price.

We can conclude there is a negligible positive correlation between price and number of up counts.


Q6: Relationship between Engine Type and Body Type? What is the impact of Engine Type and Body Type on the selling price?

A: Sedan and Gasoline is the most common followed by Gasoline and Hatchback. Engine Type and Body Type do have an impact on the selling price, but the extend of this impact will need to be investigated in our overall model.

Justification:

To begin this question, it was important to have a working knowledge of how body type and engine type related to one another. Initially a Mosaic plot felt like the right graph to show relationships, but the quantity of categories made a balloon plot much easy to read. The spacious nature of the balloon plot greatly aided in gaining insights from the data and pushed further investigation.

# Balloon Plot
ggplot(cars_edited, aes(body_type, engine_type)) + geom_count()

Based on this Balloon plot we can see that several combinations of body types and engine types do not exist. Furthermore, there seems to be higher quantities of hatchbacks with gasoline engines and sedans with gasoline engines. Nevertheless, visualization does not suffice in proving any relationship. We shall proceed with a Chi-Square test for more information. The reason for a Chi-Square test is that we are attempting to analyze the frequency table of two categorical variables (engine type and body type).

The hypotheses for the Chi-Square test are as follows:

H0 = Engine Type and Body Type are independent

Ha = Engine Type and Body Type are dependent

The test will be performed with 0.05 as our significance level. The results are the following:

#Chi-Square Test
engine_body.data <- table(cars_edited$body_type, cars_edited$engine_type)
chisq.test(engine_body.data)
## Warning in chisq.test(engine_body.data): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  engine_body.data
## X-squared = 6965.4, df = 22, p-value < 2.2e-16

The result of the Chi-Square Test tells us that Engine Type and Body Type are dependent.

We continue with a proportion table to give us an easy way of seeing the distribution of engine type and body type.

# Table
prop.table(engine_body.data)*100
##            
##                   diesel     electric     gasoline
##   cabriolet  0.010392310  0.000000000  0.184463497
##   coupe      0.080540400  0.000000000  1.613406080
##   hatchback  4.081579631  0.020784619 15.757339569
##   liftback   0.249415433  0.005196155  1.176929072
##   limousine  0.000000000  0.000000000  0.031176929
##   minibus    3.273577553  0.000000000  0.280592362
##   minivan    5.081839439  0.000000000  4.292023902
##   pickup     0.192257729  0.000000000  0.142894258
##   sedan      6.635489738  0.000000000 27.079760977
##   suv        4.359573915  0.000000000  9.051701741
##   universal  7.622759158  0.000000000  6.677058976
##   van        1.847233048  0.000000000  0.252013510

From the table we can clearly see that Hatchback and Gasoline as well as Sedan and Gasoline are the most common combinations of the two variables as we suspected from the balloon plot.

We continue by attempting to solve that second part of the question. That is, what is the impact of Engine Type and Body Type on the selling price?

The most effective method of solving this question is with the use of the Two-Way Anova. The reason for the use of this test is because we are attempting to evaluate the simultaneous effect of two grouping variables (engine and body type) on a response variable(price).

The hypotheses for the Two-Way Anova test are:

Ho = 1. There is no difference in the means of Engine Type 2. There is no difference in the means of Body Type. 3. There is no interaction between engine and body type

Ha = For cases 1 and 2 the means are not equal and for case 3 there is an interaction between Engine and Body Type.

body_engine_type_on_price.aov <- aov(price_usd ~ engine_type * body_type, data = cars_edited)
summary(body_engine_type_on_price.aov)
##                          Df    Sum Sq   Mean Sq F value Pr(>F)    
## engine_type               2 1.301e+10 6.503e+09  204.12 <2e-16 ***
## body_type                11 3.457e+11 3.142e+10  986.32 <2e-16 ***
## engine_type:body_type    11 4.280e+09 3.891e+08   12.21 <2e-16 ***
## Residuals             38465 1.225e+12 3.186e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is less than 0.05 for all three cases we can confirm that there is a difference in the means of Engine Type, the means of Body Type, and there is an interaction between Engine and Body Type.

To conclude our investigation, we will run the Tukey Honest Significant Differences to perform the multiple pairwise-comparisons between the means of groups to determine which groups are different and how they differ.

# Result omitted for brevity's sake}
#Tukey HSD
TukeyHSD(body_engine_type_on_price.aov)
## Fit: aov(formula = price_usd ~ engine_type * body_type, data = cars_edited)
## 
## $engine_type
##                         diff        lwr       upr p adj
## electric-diesel    10052.532   5867.612 14237.452 1e-07
## gasoline-diesel    -1175.359  -1318.298 -1032.420 0e+00
## gasoline-electric -11227.891 -15412.002 -7043.779 0e+00

The Tukey Test results can be evaluated to gain more detailed information on the relationship between price and Engine/Body Type.

These tests confirm our that Engine Type and Body Type significantly impact the selling price of a vehicle.


Q8: What is the average age of each vehicle manufacturer and whether the manufacturer changes how the production year impacts the price?

A: The average age of each manufacturer can be found using some data transformation. Furthermore, the manufacturer does influence how production year changes the price of a vehicle.

Justification:

Prior to working on this question, it would be useful to have some sort of visual understanding of our data. To accomplish this, we will use a scatter plat with facet wrap to show the distribution of years for each manufacturer.

#Scatter plot: Year produced by price and colored by manufacturer name
ggplot(cars_edited, aes(x = year_produced, y = price_usd)) + geom_hex() + facet_wrap(~ manufacturer_name)

Upon looking at the graph, visualization is not the way to solve this question. Nevertheless, the graphs seem to follow a similar pattern; more newer cars exist, and they are more expensive.

To solve for the average age of each manufacturer we will perform a data transformation as follows:

#Group cars by manufacturer name
manufacturer_year <- group_by(cars_edited, manufacturer_name)

#Summarize the manufacturer years average
manufacturer_year_averages <- summarise(manufacturer_year, average = mean(year_produced, na.rm = TRUE))
# 1) Average age of each vehicle manufacturer
manufacturer_year_averages %>% arrange(desc(average))
## # A tibble: 55 x 2
##    manufacturer_name average
##    <chr>               <dbl>
##  1 Lifan               2015.
##  2 Buick               2014.
##  3 Geely               2014.
##  4 LADA                2014.
##  5 Skoda               2013.
##  6 Mini                2011.
##  7 Chevrolet           2011.
##  8 Chery               2011.
##  9 Great Wall          2009.
## 10 Dacia               2009.
## # ... with 45 more rows

Observing the results of the summary the manufacturers with the newest car averages are Lifan, Buick, Geely, and LADA.

Next, we continue with the second part of the question: Whether the manufacturer changes how the production year impacts the price?

To solve this question, we will be using the Two-Way Anova. The Two-Way Anova is used because we are attempting to evaluate the simultaneous effect of Manufacturer and Year Produced on price.

The hypotheses for the Two-Way Anova test are:

Ho = 1. There is no difference in the means of Manufacturer. There is no difference in the means of Year Produced. 3. There is no interaction between Manufacturer and Year Produced

Ha = For cases 1 and 2 the means are not equal and for case 3 there is an interaction between Manufacturer and Year Produced.

For this test we will be using 0.05 as our level of significance. The results are thus:

# Do an Anova test to see if year produced significantly impacts price of a vehicle
manufacturer_price <- aov(price_usd ~ manufacturer_name * year_produced, data = cars_edited)
summary(manufacturer_price)
##                                    Df    Sum Sq   Mean Sq F value Pr(>F)    
## manufacturer_name                  54 2.917e+11 5.402e+09   428.4 <2e-16 ***
## year_produced                       1 6.854e+11 6.854e+11 54355.5 <2e-16 ***
## manufacturer_name:year_produced    54 1.273e+11 2.357e+09   186.9 <2e-16 ***
## Residuals                       38380 4.840e+11 1.261e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In our results we can see that the p-value < 0.05 for each of our 3 cases. In other words, there is a difference in the means of Manufacturer, the means of Year Produced, and there is an interaction between Manufacturer and Year Produced. The manufacturer does change how the production year affects the selling price.

Now that we know that Manufacturer and Year Produced impact the price of our vehicle, it important to see how well of a model we can produce given these two attributes. To do this we will use the Multiple Linear Regression Model.

#Taking the linear regression
ManufyearPrice <- lm (price_usd ~ manufacturer_name + year_produced, data = cars_edited)

#Making sure the linear regression line matches the model
summary(ManufyearPrice)
## 
## Call:
## lm(formula = price_usd ~ manufacturer_name + year_produced, data = cars_edited)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14133  -2147   -519   1134  44499 
## 
## Coefficients:
##                                  Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                    -1.140e+06  5.573e+03 -204.505  < 2e-16 ***
## manufacturer_nameAlfa Romeo    -5.591e+03  5.642e+02   -9.910  < 2e-16 ***
## manufacturer_nameAudi          -1.586e+03  4.978e+02   -3.187  0.00144 ** 
## manufacturer_nameAvtoVAZ       -3.893e+03  5.247e+02   -7.421 1.19e-13 ***
## manufacturer_nameBMW           -6.043e+02  4.972e+02   -1.215  0.22420    
## manufacturer_nameBuick         -4.103e+03  7.614e+02   -5.388 7.16e-08 ***
## manufacturer_nameCadillac      -1.036e+03  7.816e+02   -1.326  0.18499    
## manufacturer_nameChery         -1.024e+04  7.178e+02  -14.260  < 2e-16 ***
## manufacturer_nameChevrolet     -5.946e+03  5.268e+02  -11.287  < 2e-16 ***
## manufacturer_nameChrysler      -4.698e+03  5.291e+02   -8.879  < 2e-16 ***
## manufacturer_nameCitroen       -6.049e+03  5.013e+02  -12.068  < 2e-16 ***
## manufacturer_nameDacia         -8.635e+03  7.145e+02  -12.085  < 2e-16 ***
## manufacturer_nameDaewoo        -7.848e+03  5.596e+02  -14.023  < 2e-16 ***
## manufacturer_nameDodge         -5.002e+03  5.428e+02   -9.215  < 2e-16 ***
## manufacturer_nameFiat          -5.770e+03  5.105e+02  -11.301  < 2e-16 ***
## manufacturer_nameFord          -4.785e+03  4.974e+02   -9.621  < 2e-16 ***
## manufacturer_nameGAZ            2.130e+03  5.686e+02    3.746  0.00018 ***
## manufacturer_nameGeely         -8.969e+03  6.822e+02  -13.149  < 2e-16 ***
## manufacturer_nameGreat Wall    -7.699e+03  8.263e+02   -9.317  < 2e-16 ***
## manufacturer_nameHonda         -4.274e+03  5.109e+02   -8.366  < 2e-16 ***
## manufacturer_nameHyundai       -4.627e+03  5.052e+02   -9.159  < 2e-16 ***
## manufacturer_nameInfiniti       5.880e+02  5.824e+02    1.010  0.31262    
## manufacturer_nameIveco         -2.139e+02  5.963e+02   -0.359  0.71978    
## manufacturer_nameJaguar         4.243e+03  7.356e+02    5.768 8.06e-09 ***
## manufacturer_nameJeep          -4.869e+02  6.242e+02   -0.780  0.43541    
## manufacturer_nameKia           -4.973e+03  5.083e+02   -9.782  < 2e-16 ***
## manufacturer_nameLADA          -9.049e+03  5.918e+02  -15.290  < 2e-16 ***
## manufacturer_nameLancia        -5.546e+03  6.436e+02   -8.616  < 2e-16 ***
## manufacturer_nameLand Rover     2.458e+03  5.722e+02    4.295 1.75e-05 ***
## manufacturer_nameLexus          3.756e+03  5.618e+02    6.686 2.33e-11 ***
## manufacturer_nameLifan         -9.028e+03  7.615e+02  -11.856  < 2e-16 ***
## manufacturer_nameLincoln       -6.532e+02  8.264e+02   -0.791  0.42924    
## manufacturer_nameMazda         -4.846e+03  5.032e+02   -9.631  < 2e-16 ***
## manufacturer_nameMercedes-Benz -2.389e+02  4.983e+02   -0.479  0.63163    
## manufacturer_nameMini          -1.706e+03  6.892e+02   -2.475  0.01332 *  
## manufacturer_nameMitsubishi    -4.406e+03  5.090e+02   -8.655  < 2e-16 ***
## manufacturer_nameMoskvitch      4.373e+03  7.323e+02    5.972 2.37e-09 ***
## manufacturer_nameNissan        -4.452e+03  5.027e+02   -8.856  < 2e-16 ***
## manufacturer_nameOpel          -5.383e+03  4.969e+02  -10.832  < 2e-16 ***
## manufacturer_namePeugeot       -5.964e+03  4.994e+02  -11.942  < 2e-16 ***
## manufacturer_namePontiac       -4.690e+03  7.874e+02   -5.956 2.60e-09 ***
## manufacturer_namePorsche        5.437e+03  7.083e+02    7.676 1.68e-14 ***
## manufacturer_nameRenault       -5.894e+03  4.975e+02  -11.848  < 2e-16 ***
## manufacturer_nameRover         -5.704e+03  5.562e+02  -10.256  < 2e-16 ***
## manufacturer_nameSaab          -4.979e+03  6.233e+02   -7.988 1.41e-15 ***
## manufacturer_nameSeat          -5.163e+03  5.420e+02   -9.525  < 2e-16 ***
## manufacturer_nameSkoda         -2.145e+03  5.062e+02   -4.237 2.27e-05 ***
## manufacturer_nameSsangYong     -4.715e+03  6.650e+02   -7.090 1.37e-12 ***
## manufacturer_nameSubaru        -3.727e+03  5.438e+02   -6.854 7.28e-12 ***
## manufacturer_nameSuzuki        -5.732e+03  5.559e+02  -10.312  < 2e-16 ***
## manufacturer_nameToyota        -2.302e+03  5.037e+02   -4.570 4.90e-06 ***
## manufacturer_nameUAZ           -4.713e+03  6.756e+02   -6.977 3.06e-12 ***
## manufacturer_nameVolkswagen    -3.251e+03  4.949e+02   -6.570 5.10e-11 ***
## manufacturer_nameVolvo         -2.688e+03  5.129e+02   -5.240 1.61e-07 ***
## manufacturer_nameZAZ           -5.519e+03  7.877e+02   -7.007 2.47e-12 ***
## year_produced                   5.742e+02  2.766e+00  207.604  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3988 on 38434 degrees of freedom
## Multiple R-squared:  0.6152, Adjusted R-squared:  0.6146 
## F-statistic:  1117 on 55 and 38434 DF,  p-value: < 2.2e-16

The R2 is 0.6152 which indicates that a relatively large proportion of prices in the data can be explained by the model. Although more attributes will aid in predicting prices, we can confirm that there is significant usefulness in using Manufacturer and Year Produced in our future models.

To continue our investigation, we assess the prediction rate as follows:

confint(ManufyearPrice)
##                                        2.5 %        97.5 %
## (Intercept)                    -1150550.3835 -1128705.4004
## manufacturer_nameAlfa Romeo       -6696.4389    -4484.9382
## manufacturer_nameAudi             -2562.1524     -610.8145
## manufacturer_nameAvtoVAZ          -4921.8093    -2865.0381
## manufacturer_nameBMW              -1578.8908      370.2152
## manufacturer_nameBuick            -5595.0018    -2610.2026
## manufacturer_nameCadillac         -2567.8861      495.8650
## manufacturer_nameChery           -11643.0998    -8829.2082
## manufacturer_nameChevrolet        -6978.9568    -4913.7873
## manufacturer_nameChrysler         -5735.4878    -3661.2753
## manufacturer_nameCitroen          -7031.8311    -5066.8047
## manufacturer_nameDacia           -10035.8648    -7234.8567
## manufacturer_nameDaewoo           -8944.9157    -6751.1196
## manufacturer_nameDodge            -6065.5282    -3937.7509
## manufacturer_nameFiat             -6770.2861    -4768.9920
## manufacturer_nameFord             -5759.9662    -3810.2375
## manufacturer_nameGAZ               1015.3929     3244.3169
## manufacturer_nameGeely           -10306.3586    -7632.2903
## manufacturer_nameGreat Wall       -9318.5557    -6079.3889
## manufacturer_nameHonda            -5275.4681    -3272.7342
## manufacturer_nameHyundai          -5617.0639    -3636.6834
## manufacturer_nameInfiniti          -553.4006     1729.4829
## manufacturer_nameIveco            -1382.5997      954.7819
## manufacturer_nameJaguar            2801.3329     5684.7834
## manufacturer_nameJeep             -1710.3565      736.6019
## manufacturer_nameKia              -5969.1345    -3976.4180
## manufacturer_nameLADA            -10208.9582    -7889.0083
## manufacturer_nameLancia           -6807.0860    -4283.9923
## manufacturer_nameLand Rover        1336.1772     3579.1920
## manufacturer_nameLexus             2654.8911     4857.2551
## manufacturer_nameLifan           -10520.7052    -7535.7338
## manufacturer_nameLincoln          -2272.9355      966.4434
## manufacturer_nameMazda            -5832.1261    -3859.6730
## manufacturer_nameMercedes-Benz    -1215.5909      737.7845
## manufacturer_nameMini             -3056.5676     -355.0096
## manufacturer_nameMitsubishi       -5403.6473    -3408.1937
## manufacturer_nameMoskvitch         2937.6360     5808.1012
## manufacturer_nameNissan           -5437.3948    -3466.6670
## manufacturer_nameOpel             -6356.7827    -4408.8060
## manufacturer_namePeugeot          -6943.2019    -4985.3917
## manufacturer_namePontiac          -6233.0210    -3146.4893
## manufacturer_namePorsche           4048.7994     6825.3766
## manufacturer_nameRenault          -6868.8037    -4918.7037
## manufacturer_nameRover            -6793.9860    -4613.8458
## manufacturer_nameSaab             -6200.6518    -3757.1769
## manufacturer_nameSeat             -6225.3147    -4100.5875
## manufacturer_nameSkoda            -3136.7430    -1152.4208
## manufacturer_nameSsangYong        -6018.5656    -3411.5550
## manufacturer_nameSubaru           -4792.8442    -2661.2810
## manufacturer_nameSuzuki           -6821.7444    -4642.6653
## manufacturer_nameToyota           -3289.2724    -1314.6324
## manufacturer_nameUAZ              -6037.5839    -3389.3607
## manufacturer_nameVolkswagen       -4221.5412    -2281.4538
## manufacturer_nameVolvo            -3693.2558    -1682.5705
## manufacturer_nameZAZ              -7063.3664    -3975.6282
## year_produced                       568.7353      579.5768
sigma(ManufyearPrice)*100/mean(cars_edited$price_usd)
## [1] 60.12396

Our prediction error rate is lower than for other attributes (60.12396%) which confirms to us that year produced and Manufacturer are decent predictors of price. A prediction error of 60.12396% tells us that for prediction we will need more attributes than just year produced and Manufacturer. We conclude that Manufacturer and Year Produced do impact price and that the manufacturer does change how the production year affects the selling price.

Exchangeability Exploratory and Predictive Model

Exploring Exchangeability using a Decision Tree

To create a good exploratory model for exchangeability we will deploy a Decision tree. The model will be running the train function with 10-fold cross-validation and a tune-length of 10(number of cp values to evaluate). These settings should prune our tree and ensured an optimal decision tree.

## Using Decision Tree to create an exploratory model for exchangeability
model_DT_Exchangeable <-  train(is_exchangeable ~ . -mod, del_nameata = train.data, method = "rpart",
                                trControl = trainControl("cv",number = 10),
                                preProcess = c("center","scale"),
                                tuneLength = 10)
# Exploratory
predictionsDTExploratory <- predict(model_DT_Exchangeable, train.data)

# Check accuracy, error, and confusion matrix
accuracy <- mean(train.data$is_exchangeable == predictionsDTExploratory)
accuracy
# [1] 0.6939802
error <- mean(train.data$is_exchangeable != predictionsDTExploratory)
error
# [1] 0.3060198
confusionMatrix(train.data$is_exchangeable, predictionsDTExploratory)

The Decision Tree had an overall accuracy of around 69.4%. Meaning that the explanatory power is equal to 69.4%. The misclassification rate was 30.6%

The Accuracy for when we have an exchangeability of “TRUE” is ~63.5%(sensitivity)

The Accuracy for when we have an exchangeability of “False” is ~70.6%(sensitivity)

The model precision of the proportion of positive predicted value is 31.16%.

# Compute roc
predictionsDTExploratoryProb <- predict(model_DT_Exchangeable, train.data, type = "prob")
res.roc <- roc(train.data$is_exchangeable ~ predictionsDTExploratoryProb[,2])
plot.roc(res.roc, print.auc = TRUE)
as.numeric(res.roc$auc)
# [1] 0.6571428

# Get the probability threshold for specificity = 0.5
rocModelDT.data <- data_frame(
  thresholds = res.roc$thresholds,
  sensitivity = res.roc$sensitivities,
  specificity = res.roc$specificities
)
rocModelDT.data %>% filter(specificity >= 0.5)
plot.roc(res.roc, print.auc = TRUE, print.thres = "best")

The AUC is above 0.5 which tells us that this is good exploratory model. The best threshold with the highest sum sensitivity and specificity is 0.357 and we get a specificity of 0.818 and a sensitivity of 0.419.

Predicting Exchangeability using a Decision Tree

To analyze the predictive power of our model we used the testing dataset.

# Predictive Model
predictionsDT <- predict(model_DT_Exchangeable, test.data)

# Check accuracy, error, and confusion matrix
accuracy <- mean(test.data$is_exchangeable == predictionsDT)
accuracy
# [1] 0.6871661
error <- mean(test.data$is_exchangeable != predictionsDT)
error
# [1] 0.3128339
confusionMatrix(test.data$is_exchangeable, predictionsDT)

The accuracy above tells us that the Decision Tree correctly predicted ~69% of the individuals who agreed to exchanging their vehicles. This is better than random guessing. The misclassification error rate is ~31%.

Sensitivity is ~60.6%, that is the proportion of individuals who were correctly identified to being willing to take part in an exchange for their vehicle.

The specificity of the model is around 70.4% which is the proportion of individuals who were correctly identified to not being willing to take part in an exchange for their vehicle.

The model precision or proportion of positive predicted value is ~ 29.8%

#Compute roc
predictionsDTProb <- predict(model_DT_Exchangeable, test.data, type = "prob")
res.roc <- roc(test.data$is_exchangeable ~ predictionsDTProb[,2])
plot.roc(res.roc, print.auc = TRUE)
as.numeric(res.roc$auc)
# [1] 0.6525616

# Get the probability threshold for specificity = 0.5
rocModelDT.data <- data_frame(
  thresholds = res.roc$thresholds,
  sensitivity = res.roc$sensitivities,
  specificity = res.roc$specificities
)
rocModelDT.data %>% filter(specificity >= 0.5)
plot.roc(res.roc, print.auc = TRUE, print.thres = "best")

In this graph the AUC is 0.653, which i relatively good. Remember that a classifier that performs no better than chance is expected to have an AUC of 0.5 when evaluated on an independent test set not used to train the model. Meaning, that this model does in fact predict exchangeability with some success. The best threshold with the highest sum sensitivity and specificity is 0.332 and we get a specificity of 0.772 and a sensitivity of 0.457. Ultimately, we learn that the Decision Tree Classification produces a good predictive model for exchangeability of a vehicle.

Exploring Exchangeability using Logistic Classification

We attempted to create a more accurate exploratory model for exchangeability by using Logistic Classification. The model will be running the train function with 10-fold cross-validation and a tune-length of 10(number of cp values to evaluate). These settings should produce an optimal model.

## Using Logistic Classification to create an exploratory model for exchangeability
set.seed(123)
model_LR_Exchangeable <-  train( is_exchangeable ~ . -model_name, data = train.data, method = "glm", family = "binomial",
                                 trControl = trainControl("cv", number =10),
                                 preProcess = c("center", "scale"),
                                 tuneLength = 10
)

# Exploratory
predictionsLRExploratory <- predict(model_LR_Exchangeable, train.data)

# Check accuracy, error, and confusion matrix
accuracy <- mean(train.data$is_exchangeable == predictionsLRExploratory)
accuracy
# [1] 0.6624371       
error <- mean(train.data$is_exchangeable != predictionsLRExploratory)
error
# [1] 0.3375629       
confusionMatrix(train.data$is_exchangeable, predictionsLRExploratory)

The accuracy above tells us that the Logistic Classification correctly predicted ~66% of the individuals who agreed to exchanging their vehicles. This is better than random guessing. The misclassification error rate is ~33.76%.

Sensitivity is ~58.46%, that is the proportion of individuals who were correctly identified to being willing to take part in an exchange for their vehicle.

The specificity of the model is around 67% which is the proportion of individuals who were correctly identified to not being willing to take part in an exchange for their vehicle.

The model precision or proportion of positive predicted value is ~ 14.85%

# Compute roc
predictionsLRExploratoryProb <- predict(model_LR_Exchangeable, train.data, type = "prob")
res.roc <- roc(train.data$is_exchangeable ~ predictionsLRExploratoryProb[,2])
plot.roc(res.roc, print.auc = TRUE)
as.numeric(res.roc$auc)
# [1] 0.6505955

# Get the probability threshold for specificity = 0.5
rocModelLRExploratory.data <- data_frame(
  thresholds = res.roc$thresholds,
  sensitivity = res.roc$sensitivities,
  specificity = res.roc$specificities
)
rocModelLRExploratory.data %>% filter(specificity >= 0.5)
plot.roc(res.roc, print.auc = TRUE, print.thres = "best")

In this graph the AUC is 0.651, which i relatively good and quite like the AUC from the Decision Tree model. Since the classifier is better that 0.5, we can comfortably say that this is a good exploratory model. The best threshold with the highest sum sensitivity and specificity is 0.356 and we get a specificity of 0.640 and a sensitivity of 0.580. The Logistic Regression Model also produces a good exploratory model.

Predicting Exchangeability using Logistic Regression

## Using Logistic Regression to predict exchangeability
set.seed(123)
model_LR_Exchangeable <-  train( is_exchangeable ~ ., data = train.data, method = "glm", family = "binomial",
                                 trControl = trainControl("cv", number =10),
                                 preProcess = c("center", "scale"),
                                 tuneLength = 10
)

predictionsLR <- predict(model_LR_Exchangeable, test.data)
# Check accuracy, error, and confusion matrix
accuracy <- mean(test.data$is_exchangeable == predictionsLR)
accuracy
# [1] 0.6650163
error <- mean(test.data$is_exchangeable != predictionsLR)
error
# [1] 0.3349837
confusionMatrix(test.data$is_exchangeable, predictionsLR)

The accuracy above tells us that the Logistic Classification correctly predicted ~66.5% of the individuals who agreed to exchanging their vehicles. This is a good accuracy for a predictive model. The misclassification error rate is ~33.5%.

Sensitivity is ~58% , that is the proportion of individuals who were correctly identified to being willing to take part in an exchange for their vehicle.

The specificity of the model is around 67.3% which is the proportion of individuals who were correctly identified to not being willing to take part in an exchange for their vehicle.

The model precision or proportion of positive predicted value is ~14.8%

#Compute roc
predictionsLRProb <- predict(model_LR_Exchangeable, test.data, type = "prob")
res.rocLR <- roc(test.data$is_exchangeable ~ predictionsLRProb[,2])
plot.roc(res.rocLR, print.auc = TRUE)
as.numeric(res.rocLR$auc)
# [1] 0.6480223


# Get the probability threshold for specificity = 0.5
rocModelLR.data <- data_frame(
  thresholds = res.rocLR$thresholds,
  sensitivity = res.rocLR$sensitivities,
  specificity = res.rocLR$specificities
)
rocModelLR.data %>% filter(specificity >= 0.5)
plot.roc(res.rocLR, print.auc = TRUE, print.thres = "best")

In this graph the AUC is 0.648 is a good AUC from a model. This model does predict exchangeability with success. The best threshold with the highest sum sensitivity and specificity is 0.356 and we get a specificity of 0.647 and a sensitivity of 0.580. The Logistic Regression Model also produces a good predictive model.

Models Accuracy Error Sensitivity Specificity Positive Prediction Value AUC
Decision Tree Classification: Exploratory Model 0.6939802 0.3060198 0.6349 0.7064 0.3116 0.6571428
Logistic Classification: Exploratory Model 0.6624371 0.3375629 0.58457 0.67010 0.14850 0.6505955
Decision Tree Classification: Prediction Model 0.6871661 0.3128339 0.6062 0.7039 0.2980 0.6525616
Logistic Classification: Prediction Model 0.6650163 0.3349837 0.58041 0.67329 0.14808 0.6480223

Model Results and Comparisons

To create the most accurate and significant predictive model we created 6 different models which had a wide range of accuracy.

Linear Regression Model

Firstly, we created a Multiple Linear Regression Model which utilized the attributes in our dataset (except model_names) and optimized the model. The code for this looks as follows:

set.seed(123)
LM <- lm(price_usd ~ odometer_value
         + year_produced
         + number_of_photos
         + duration_listed
         + up_counter, data = train.data)
step.LM <- LM %>% stepAIC(trace = FALSE)
vif(step.LM)
##   odometer_value    year_produced number_of_photos  duration_listed 
##         1.311552         1.375432         1.088560         1.985559 
##       up_counter 
##         1.995992

The VIF tells us that there is not significant multi-collinearity between any of the continuous attributes.

Now that the model is created and optimized it is vital to check the accuracy of the model. To do this we will employ our test.data to test how accurately the Linear Regression model can predict the price of a vehicle.

summary(step.LM)
## 
## Call:
## lm(formula = price_usd ~ odometer_value + year_produced + number_of_photos + 
##     duration_listed + up_counter, data = train.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15620  -2427   -834   1352  46600 
## 
## Coefficients:
##                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)      -9.916e+05  7.305e+03 -135.756  < 2e-16 ***
## odometer_value   -4.419e-03  2.113e-04  -20.919  < 2e-16 ***
## year_produced     4.981e+02  3.639e+00  136.895  < 2e-16 ***
## number_of_photos  1.482e+02  4.251e+00   34.854  < 2e-16 ***
## duration_listed   2.242e+00  3.143e-01    7.134 9.98e-13 ***
## up_counter        2.253e+00  8.082e-01    2.787  0.00532 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4385 on 30809 degrees of freedom
## Multiple R-squared:  0.5304, Adjusted R-squared:  0.5303 
## F-statistic:  6959 on 5 and 30809 DF,  p-value: < 2.2e-16
coef(step.LM)
##      (Intercept)   odometer_value    year_produced number_of_photos 
##    -9.916423e+05    -4.419525e-03     4.981272e+02     1.481601e+02 
##  duration_listed       up_counter 
##     2.241819e+00     2.252599e+00
confint(step.LM)
##                          2.5 %        97.5 %
## (Intercept)      -1.005960e+06 -9.773249e+05
## odometer_value   -4.833614e-03 -4.005436e-03
## year_produced     4.909951e+02  5.052594e+02
## number_of_photos  1.398282e+02  1.564919e+02
## duration_listed   1.625863e+00  2.857775e+00
## up_counter        6.685618e-01  3.836636e+00

The information above is crucial for calculating values and having a general idea for the stability of our coefficients.

LMPrediction <- predict(step.LM, test.data)

# Prediction error, rmse
RMSE(LMPrediction,test.data$price_usd)
## [1] 4573.511
# Compute R-square
R2(LMPrediction,test.data$price_usd)
## [1] 0.5095891

From the code above we see that our rmse is 4470.274 which represents an error rate of 4470.274/mean(test.data$price_usd) = 67.57975 which is not good. Meanwhile, the R-Squared is 0.5190641, meaning that the observed and predicted outcome values are not very correlated, which is not good. These results are not surprising and inform us that the price of a vehicle in Belarus is dependent on more attributes than simply our continuous attributes. We shall proceed with more robust models to achieve a better result.

note: A logarithmic transformation was done on the price and achieved an even worse result. Given the poor quality of this model we continued with SVR to get a more robust and accurate predictive model.

SVR Model

SVR is an extremely robust model which would be able to handle our categorical data. These models would almost certainly achieve a better result that the Linear Regression Model. 3 SVR models were calculated with varying accuracies. The different methods used with SVR were linear, polynomial, and radial.

Linear was the first SVR model to be run and the code was as follows:

# Create SVR Model using Linear Method
set.seed(123)
modelSVRLinTrain <- train( price_usd ~ ., data = train.data, method = "svmLinear",
                           trControl = trainControl("cv", number =10),
                           preProcess = c("center", "scale"),
                           tuneLength = 10
)

summary(modelSVRLinTrain)
#Length  Class   Mode 
#1   ksvm     S4 
modelSVRLinTrain$bestTune
#C
#1 1

We were able to find that the bestTune was 1 which informs us what the best tuning parameter C that maximizes our accuracy. We proceed with using the model to predict our prices and comparing them to the actual testing prices to gauge accuracy of the model.

# Predict using SVR Model with Linear Method
modelSVRLinTrainPrediction <- predict(modelSVRLinTrain, test.data)

# Prediction error, rmse
RMSE(modelSVRLinTrainPrediction,test.data$price_usd)
#[1] 3257.453

# Compute R-square
R2(modelSVRLinTrainPrediction,test.data$price_usd)
#[1] 0.777241

Observing RMSE(3257.453) we can see how concentrated the data is around our model. Calculating our error rate, we see 3257.453/mean(test.data$price_usd) *100 = 48.86270 is not great, but significantly better than our Linear Regression Model. Also, a R-Squared of 0.7772410 is a significant increase in accuracy as well. We know that around 77.7% of our prices can be explained by our model. Nevertheless, we continue to find better models:

The next model to be computed is the SVR model using the polynomial method: This model was not able to compute in time and would be a wonderful model to check given more time.

# Create SVR Model using Polynomial Method
set.seed(123)
modelSVRPolyTrain <- train(price_usd ~ ., data = train.data, method = "svmPoly",
                           trControl = trainControl("cv", number =10),
                           preProcess = c("center", "scale"),
                           tuneLength = 10
)

summary(modelSVRPolyTrain)
modelSVRPolyTrain$bestTune
# Predict using SVR Model with Linear Method
modelSVRPolyTrainPrediction <- predict(modelSVRPolyTrain, test.data)

# Prediction error, rmse
RMSE(modelSVRPolyTrainPrediction,test.data$price_usd)

# Compute R-square
R2(modelSVRPolyTrainPrediction,test.data$price_usd)

Lastly, the radial method was used with the SVR model:

# Create SVR Model using Radial Method
set.seed(123)
modelSVRRadialTrain <- train(price_usd ~ ., data = train.data, method = "svmRadial",
                             trControl = trainControl("cv", number =10),
                             preProcess = c("center", "scale"),
                             tuneLength = 10
)

summary(modelSVRRadialTrain)
#Length  Class   Mode
#1   ksvm     S4                     
modelSVRRadialTrain$bestTune
#sigma   C
#10 0.001556481 128 

BestTune was 128 which tells us what the best tuning parameter C that maximizes our accuracy is 128. We then used our model to predict prices of vehicles and compared those prices to the actual testing prices to assess the accuracy of the model.

# Predict using SVR Model with Radial Method
modelSVRRadialTrainPrediction <- predict(modelSVRRadialTrain, test.data)

# Prediction error, rmse
RMSE(modelSVRRadialTrainPrediction,test.data$price_usd)
#[1] 4752.231

# Compute R-square
R2(modelSVRRadialTrainPrediction,test.data$price_usd)
#[1] 0.5937837

From the RMSE(4752.231) we calculated our error rate as 4752.231/mean(test.data$price_usd) *100 = 71.28478. This is even worse than the results we had from the Linear Regression Model. Not surprisingly our R2 was also quite poor. The R-Squared was 0.5937837 which again is not a good value. We know that around 59.4% of our prices can be explained by our model. These results convince us that radial basis functions are not the optimal function to be used with SVR for our model.

Given the result one can conclude that the data may have to be scaled to optimize the results of the model. Given more time it may be beneficial rerun the model with the scaled dataset to see if a better result could be achieved.

Even though the SVR model did much better as a whole than the Linear Regression Model it was vital to investigate more models to create an even more accurate model. The Decision Tree was a perfect candidate since it is incredibly robust and relies on very few assumptions. Its simpler nature also makes it a model that would be preferred over other models (such as Random Forest Regression).

Decision Tree Regression Model

We began our Decision Tree model by running train with 10-fold cross-validation and a tune-length of 10(number of cp values to evaluate) as with our SVR model. These settings pruned our tree and ensured an optimal decision tree.

set.seed(123)
model_DT_Train <- train(price_usd ~ ., data = train.data, method = "rpart",
                        trControl = trainControl("cv",number = 10),
                        preProcess = c("center","scale"),
                        tuneLength = 10)

summary(model_DT_Train)
#See summary(model_DT_Train)(2nd Run).txt
#For results
model_DT_Train$bestTune
#          cp
#1 0.01032955
plot(model_DT_Train)

Our value for bestTune was 0.01032955 which tells us what the best tuning parameter C that maximizes our accuracy is 0.01032955. The plot below shows the relationship between RMSE and the cp values. One can easily see that the cp value was chosen to minimize the value of the RMSE.

Next, we will plot the final model for the decision tree as well as the decision rules for our final model

# Plot the final tree model
par(xpd = NA) # Avoid clipping the text in some device
plot(model_DT_Train$finalModel)
text(model_DT_Train$finalModel, digits = 3)

#Decision rules in the model
model_DT_Train$finalModel
# See model_DT_TrainfinalModel-1.txt

Once the Decision Tree is created and pruned, we will then use it to predict values of our vehicle prices and analyze the accuracy of the model.

#Decision rules in the model model_DT_Train$finalModel

# Make predictions on the test data 
prediction_DT_Train <- model_DT_Train %>% predict(test.data)

# Prediction error, rmse
RMSE(prediction_DT_Train,test.data$price_usd)
#[1] 3245.413

# Compute R-square 
R2(prediction_DT_Train,test.data$price_usd) 
#[1] 0.7529956

Given the RMSE being 3245.413 we calculated our error rate as 3245.413/mean(test.data$price_usd) *100 = 48.68209. This is better than the Linear Regression Model, as well as SVR with Linear and Radial Kernels. Not surprisingly we have a relatively good R-Squared at 0.7529956 which is better than all but SVR with Linear Kernel and Linear Regression. The R-Squared tells us that around 75.3% of our prices can be explained by our model. These results are good, but at this point more models could not hurt. There is no guarantee that other models would perform better, but experimentation is optimal in a search for a better model.

Given how well the Decision Tree model operated it would make sense to try the Random Forest Tree Model given that the Random Forest Regression is a more complicated application of the Decision Tree since it leverages multiple decision trees. In essence, one can expect a better result from the Random Forest Tree (the question lies in whether the improvement is worth the complexity in calculation)

Random Forest Tree Model

The Random Forest Tree model was run with 10-fold cross-validation and a tune-length of 10(number of cp values to evaluate for optimization) as with SVR and the Decision tree. These settings should ensure an optimal Random Forest Tree Model.

set.seed(123)
random_forest_ranger <- train(price_usd ~ . ,
                              data = train.data,
                              method = "ranger",
                              trControl = trainControl("cv", number = 10),
                              preProcess = c("center","scale"),
                              tuneLength = 10
)

summary(random_forest_ranger)
#                          Length Class         Mode     
#predictions               30815  -none-        numeric  
#num.trees                     1  -none-        numeric  
#num.independent.variables     1  -none-        numeric  
#mtry                          1  -none-        numeric  
#min.node.size                 1  -none-        numeric  
#prediction.error              1  -none-        numeric  
#forest                        7  ranger.forest list     
#splitrule                     1  -none-        character
#num.random.splits             1  -none-        numeric  
#treetype                      1  -none-        character
#r.squared                     1  -none-        numeric  
#call                          9  -none-        call     
#importance.mode               1  -none-        character
#num.samples                   1  -none-        numeric  
#replace                       1  -none-        logical  
#xNames                     1215  -none-        character
#problemType                   1  -none-        character
#tuneValue                     3  data.frame    list     
#obsLevels                     1  -none-        logical  
#param                         0  -none-        list 

random_forest_ranger$finalModel
#Ranger result
#
#Call:
#  ranger::ranger(dependent.variable.name = ".outcome", data = x,      mtry = min(param$mtry, ncol(x)), min.node.size = param$min.node.size,      splitrule = as.character(param$splitrule), write.forest = TRUE,      probability = classProbs, ...) 
#
#Type:                             Regression 
#Number of trees:                  500 
#Sample size:                      30815 
#Number of independent variables:  1215 
#Mtry:                             1215 
#Target node size:                 5 
#Variable importance mode:         none 
#Splitrule:                        extratrees 
#Number of random splits:          1 
#OOB prediction error (MSE):       3137444 
#R squared (OOB):                  0.9233405 

Below is the plot for the random forest Tree. This plot shows the values for the model accuracy vs different values of the complexity parameter.

Once the Random Forest Tree is made, we wish to gauge the accuracy of the model. To accomplish this, we use the predict function to predict values of our vehicle prices. Afterwards, we will compare these values to the actual prices of the test dataset to find the accuracy of the model.

# Make predictions on the test data
rf_predict_ranger <- predict(random_forest_ranger, test.data)

# Prediction error, rmse
RMSE(rf_predict_ranger,test.data$price_usd)
#[1] 1879.884

# Compute R-square
R2(rf_predict_ranger,test.data$price_usd)
#[1] 0.9184761

The RMSE is 1879.884 we calculated our error rate as 1879.884/mean(test.data$price_usd) *100 = 28.19878. This is the best error rate so far. Not surprisingly we have a great R-Squared at 0.9184761. The R2 tells us that around 91.8% of our prices can be explained by our model. These results are the best we have so far, but improvement may still be possible. We continue with KNN to achieve a better result (if possible).

KNN Model

set.seed(123)
model_knn <- train(
  price_usd ~., data = train.data, method = "knn",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale"),
  tuneLength = 20
)

summary(model_knn$finalModel)
#Length Class      Mode     
#learn          2   -none-     list     
#k              1   -none-     numeric  
#theDots        0   -none-     list     
#xNames      1215   -none-     character
#problemType    1   -none-     character
#tuneValue      1   data.frame list     
#obsLevels      1   -none-     logical  
#param          0   -none-     list

# Print the best tuning parameter k that maximizes model accuracy
model_knn$bestTune
#k
#1 5

A best k value was 5 tells us what the best tuning parameter K that maximizes our accuracy is 5.

We continue our investigation of KNN by plotting the model accuracy of KNN relative to different values of k.

# Plot model accuracy vs different values of k
plot(model_knn)

The plot below illustrates the relationship between RMSE and cp values. The cp value is chosen to minimize the RMSE and thus optimize the accuracy of the graph.

With optimization for our KNN model complete we now direct our attention to evaluating the accuracy of the model. This is done by using the model to predict prices from our test dataset. This ensures the model is tested on data that is not in the training and allows us to test the accuracy of the predictions against known values.

# Make predictions on the test data
knn_predictions <- model_knn %>% predict(test.data)
head(knn_predictions)
#[1] 9560.000 9000.000 4580.000 5648.494 8575.200 7720.000

# Compute the prediction error RMSE
RMSE(knn_predictions,test.data$price_usd)
#[1] 3693.127

# Compute R-square
R2(knn_predictions,test.data$price_usd)
#[1] 0.6802895

Given the RMSE being 3693.127 we calculated our error rate as 3693.127/mean(test.data$price_usd) *100 = 55.39793. This is better than SVR with Radial Kernel and puts it at our second lowest performing model. The R-Squared at 0.6802895 tells us that around 68.02% of our prices can be explained by our model.

Given the result of the KNN model one can conclude that more may have to be done to optimize the results of the model. Given more time it may be beneficial to scale the data and rerun the model to see if a better result could be achieved.

Results Obtained

Machine-Learning Methods RMSE Error Rate R-Square
Multiple Linear Regression (Continuous) 4470.274 67.57975 0.5190641
SVR Linear 3257.453 48.86270 0.7772410
SVR Radial 4752.231 71.28478 0.5937837
Decision Tree Regression 3245.413 48.68209 0.7529956
Random Forest Tree Regression 1879.884 28.19878 0.9184761
KNN(K Nearest Neighbor) 3693.127 55.39793 0.6802895

Conclusion

We have thoroughly and exhaustively analyzed our data set with the tools that we have learned to use through our Data Analytics course. We can conclude that the asking price of a vehicle depends on many variables. There are several correlations and trends that we have been able to identify. Some vehicles do not occur often due to a limit in the data collected and for that reason prediction made on vehicles (specifically models) that are not common ought to be taken with the limitations of the models in mind. More testing is needed to delve further into our questions and optimize our models. As is the case with data science, the process of managing data, finding new insights through analytics, and making conclusions will continue to occur. As we find answers, we find new questions to ask, and that is the core of data science.